A Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space

نویسندگان

  • Gary W. Stuart
  • Michael W. Berry
چکیده

As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Methods introduced to date have generally utilized incomplete and likely insufficient subsets of the available data. We have developed an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of a large sparse data matrix in which each protein is uniquely represented as a vector of overlapping tetrapeptide frequencies. Quantitative pairwise estimates of species similarity were obtained by summing the protein vectors to form species vectors, then determining the cosines of the angles between species vectors. Evolutionary trees produced using this method confirmed many accepted prokaryotic relationships. However, several unconventional relationships were also noted. In addition, we demonstrate that many of the SVD-derived right basis vectors represent particular conserved protein families, while many of the corresponding left basis vectors describe conserved motifs within these families as sets of correlated peptides (copeps). This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes.

We recently developed a method for producing comprehensive gene and species phylogenies from unaligned whole genome data using singular value decomposition (SVD) to analyze character string frequencies. This work provides an integrated gene and species phylogeny for 64 vertebrate mitochondrial genomes composed of 832 total proteins. In addition, to provide a theoretical basis for the method, we...

متن کامل

Whole-Genome Sequencing of a Clinically Isolated Antibiotic-Resistant Enterococcus faecium EntfacYE

Background and Objective: Enterococcal infections are considered the most common nosocomial infections. Nowadays, enterococci show high resistance to common antibiotics, especially vancomycin. Vancomycin-resistant Enterococcus faecium is one of the most common nosocomial infections, which is included in the World Health Organization priority pathogens list for research and development of new an...

متن کامل

Improving Phylogeny Reconstruction at the Strain Level Using Peptidome Datasets

Typical bacterial strain differentiation methods are often challenged by high genetic similarity between strains. To address this problem, we introduce a novel in silico peptide fingerprinting method based on conventional wet-lab protocols that enables the identification of potential strain-specific peptides. These can be further investigated using in vitro approaches, laying a foundation for t...

متن کامل

Enhanced Solubility of Anti-HER2 scFv Using Bacterial Pelb Leader Sequence

Single chain Fragment variable (scFv) is an antibody fragment consisting variable regions of heavy and light chains. scFvs enhance their penetrability into tissues while maintaining specific affinity and having low immunogenicity. Insoluble inclusion bodies are formed when scFvs are expressed in reducing bacterial cytoplasm. One strategy for obtaining functionally active scFv is to translocate ...

متن کامل

A Whole Genome Phylogeny Using Truncated Pivoted QR Decomposition

The increasing availability of whole genome sequences in public databases has stimulated the development of new methods to automatically compare and categorize genes and species. Recently developed methods based on the singular value decomposition (SVD) allow for the simultaneous identification and definition of well concerved motifs and gene families using very large whole genome datasets. In ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of bioinformatics and computational biology

دوره 1 3  شماره 

صفحات  -

تاریخ انتشار 2003